Scalable Simple Random Sampling and Stratified Sampling

نویسنده

  • Xiangrui Meng
چکیده

Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only O( √ k) storage, where k is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Evaluation of Stratified Sampling of Microarchitecture Simulations

Recent research advocates applying sampling to accelerate microarchitecture simulation. Simple random sampling offers accurate performance estimates (with a high quantifiable confidence) by taking a large number (e.g., 10,000) of short performance measurements over the full length of a benchmark. Simple random sampling does not exploit the often repetitive behaviors of benchmarks, collecting ma...

متن کامل

Method of Fuzzy Ratio Estimate

This article develops a method of ratio estimate in fuzzy sense. By both the simple random sampling and stratified random sampling, we can obtain the ratio estimate in usual statistical sense. However, the sampling data may be ambiguous in some uncertain circumstance. To solve such kind of problem, we probe into the simple random sampling and stratified random sampling in fuzzy sense, obtain th...

متن کامل

Sampling Survey of Heavy Metal in Soil Using SSSI

Much attention has been given to sampling design, and the sampling method chosen directly affects the sampling accuracy. The development of spatial sampling theory has lead to the recognition of the importance of taking spatial dependency into account when sampling. This text uses the new Sandwich Spatial Sampling and Inference (SSSI) software as a tool to compare the relative error, coefficien...

متن کامل

Comparison of Sampling Techniques on the Performance of Monte- Carlo Based Sensitivity Analysis

Sensitivity analysis is a key part of a comprehensive energy simulation study. Monte-Carlo techniques have been successfully applied to many simulation tools. Several sampling techniques have been proposed in the literature; however to date there has been no comparison of their performance for typical building simulation applications. This paper examines the performance of simple random, strati...

متن کامل

Perfect and Maximum Randomness in Stratified Sampling over Joins

Supporting sampling in the presence of joins is an important problem in data analysis. Pushing down the sampling operator through both sides of the join is inherently challenging due to data skew and correlation issues between output tuples. Joining simple random samples of base relations typically leads to results that are non-random. Current solutions to this problem perform biased sampling o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013